THE DREAM TEAM



Analysis of Charting Data for Social Movements
December 9th 2019
Group Members:
Brandon Herren (bsh46), Milan Champion (mtc106), and Pomaikai Canaday (pmc101)
Introduction:
Loersch, C., & Arbuckle, N. L. (2013). Unraveling the mystery of music: Music as an evolved group process. Journal of Personality and Social Psychology, 105(5), 777-798.
Eyerman, R., & Jamison, A. (1998). Music and social movements: Mobilizing traditions in the twentieth century. Cambridge University Press.
For our project, we conducted a thorough analysis of popular music throughout the last 55 years and how it has coincided with social movements. We were curious as to whether music reflected the social change occurring at the time and how, in turn, social change might affect music and lyrics. We believe that such an analysis would prove significant for listeners, artists, sociologists, and anyone looking to analyze the trends that influence popular music. As a group, we looked at many different attributes for music; from gender of the artists, to genre, and finally lyrics. Our goal was to categorize and analyze the impact that cultural phenomena at the time had on music.
We believe that the results of our findings can be applied on a wide scale; from determining which artists are hired to predict the popularity of music and rankings in the future. It will be interesting to see whether or not such an analysis will lead to an increase or decrease in music diversity from both a producer and consumer standpoint in terms of gender, lyrics, and musical genre.
Background:
Genre:
For each of these songs, we performed API calls using a music rating website called Last.fm, where users tag songs with associated words, many of which are musical genres of the song. We considered other sources for genre information, such as Wikipedia entries on the songs or a number of other databases, but decided against it, due to the lack of genre name consistency and number of missing songs. We were able to pull acceptable tagging data for 5,822 out of 5,900 songs (98.7%). The remaining songs were relatively evenly distributed across the time period, so there was likely minimal sampling bias.
In order to remove noise from this genre data, we created measures of genre correlation for 14 separate categories: pop, rock, indie, alternative, metal, soul, R&B, hip hop, rap, jazz, electronic, dance, country, and folk. From there, we took the sum of tags associated with each of these categories for each song, so that each song has a weighted rating of association for each genre, which we felt was the best approach for representing multi-genre songs.
Lyrics:
In addition to genre classification, we pulled lyrics information for each song from Fandom data, which we later used to perform sentiment analysis which will be detailed below.
Gender:
We used the MusicBrainz API to get gender data for artists, which we felt would be pertinent for our study of the evolution of musical diversity. It provided information on which artists were groups and which were individuals, and whether the individuals were male, female, or identified otherwise.
Data Cleaning:
Basic Chart Data:
There was relatively little cleaning necessary for this section of our data, as it came pre-formatted from Wikipedia. There was a scraping error in which two years were accidentally duplicated in the early part of our analysis, which we later rectified. Most of the cleaning that we needed involved for song and artist names, as the convention for including featured artists was not consistent across the dataset, with (feat. other artist) appearing in some form in either the song title or the artist name. Given that Last.fm, Fandom, and MusicBrainz had their own conventions for this, we had to convert each of the incorrect names to the proper formatting, which we were able to do quite successfully with a set of test rules for converting the name and running API calls for each of the formats to see what was successful.
Genre:
The Last.fm tag data contained a large amount of noisy data, as many users either incorrectly tagged a song, or tagged a song with information irrelevant to genre, such as which artist the song belonged to, or a TV show that the song appeared in. Given that each of the tags had a weighted value associated with it, we removed all tags that had a weight of less than five (where 100 is the weight of the biggest tag). This helped to remove miscellaneous or irrelevant data points. In order to create the genre subcategories, we combined all tags that contained any version of the base word of the genre in them. For example, the rock subcategory includes tags like alternative rock, psychedelic rock, indie rock, etc, which we felt was the most effective and objective way to associate genres with a broad category. Some of these tags may be associated with multiple categories, such as pop rock.
Lyrics:
For the lyric data, cleaning had been relatively easy. For the most part, we took out songs that have had null lyrics or that we could not find in the database.
Data Description and Graphs:
Genre Data:
As mentioned above, we acquired genre tag data for each of the 5600 songs, each of which was in the format of a dictionary with keys being the tag names and the values being the relative weight of how much the song had been tagged with that key.
EX. {rock: 100, pop: 78, dance: 12}
This data had plenty of small pitfalls - as user tagged data meant that plenty of song tags were relatively nonsensical or irrelevant to genre (a shocking number of “disney” tagged songs). In order to categorize the tags into more specific genre subcategories, we created, based on the most common tags in the set, the following 14 groups of broader genre categories:
[rock, pop, indie, alternative, soul, r&b, hip hop, rap, dance, electronic, folk, jazz, country, metal]
From there, we calculated genre “overlap ratios” for each of the individual tags, for instance, for the tag “indie rock” - we looked at how much overlap there was between that tag (in each song it appeared in) and each category of genres (see findGenreCorrelation). Based on the overlap for the entire data set, we calculated the overall overlap value for each tag, which is then saved into a file titled: genre_correlation_11-06-2019_170429.csv.
We also performed some basic statistical analysis for each of the tags, finding the mean/standard deviation/spread of the release years for each song that included the given tag. These matched our expectations (who would’ve guessed that the ‘80s’ tag has its mean release year… in the 80’s!).
Genre Correlation Analysis:
Using these new overlap values, we calculated the correlations between each of the ratios, which produced values similar to what we expected. Rock correlated with metal/alternative/indie, hip hop correlated with rap/r&b, while rock and hip hop themselves were quite strongly inversely correlated. This makes sense with our broad understanding of musical genres and those two specifically are generally not considered similarly musically. It also speaks to possible social trends, as rock is primarily associated (despite its mixed origins) with white artists, whereas hip hop is associated with African-American artists. In many ways, the separation of genres speaks to the ongoing segregation of music. Here were the overall results:
(clearer output is displayed in genreClustering.py)
Also in heatmap form:
The histograms provide some insight into the distribution of each of the genre ratios. Pop occurs by far the most common, with the highest proportion above 0.5. Other popular genres, such as rock, R&B, and soul, all have smaller niches which are highly matched to them. Conversely, genres like jazz, electronic, and metal, which are even smaller musical niches, have a large proportion at less than 0.1 matching. In a way, this reflects the dearth of musical diversity in the most popular songs, with pop being consistently dominant.
Cluster Analysis:
In order to further test the performance of the genre ratio values, we attempted to cluster the various tags based on the given genre ratio values. Naturally, there were some issues with this, as clustering 14-dimensional values did not lend itself to very high silhouette values. Prior to clustering, we attempted to cluster based on simply taking the single highest genre association for each tag in placing it in that genre’s cluster (making it 14 clusters total). The following scatter plot was produced from those results.
We attempted clustering using PCA analysis, given that 2D visualizations weren’t exactly the most enlightening for our 14 dimensions of data, which produced the following results.
For the initial genre classifications:
This provides a much better overview of the genres and their differences, with pop (in orange), unsurprisingly occupying the middle space of the genres, while rock, soul, and r&b all push off in separate directions clustered densely together.
Here were the PCA visualizations for the k-means clusters:
Using PCA analysis, we can see that despite n=2’s higher silhouette value, n=4 seems to perform better at classifying tags by genre, especially when compared to the initial classification clusters above. N=14 appears extremely ineffective, in spite of that being the number of “genre categories” that exist in the data set, suggesting that a k-means analysis wasn’t able to best parse this situation as not all genres are of equal size and density. Nevertheless, our overall classification system seems fairly effective at categorizing songs into separate clusters by genre.
Trends in Music Diversity over Time
In order to analyze any increase or decrease in musical diversity by genre over time, we graphed the average weight percentage of each genre group for all of the songs in each year, and plotted these values on a line graph.
This graph shows that popular music has actually become more diverse over the past few decades, as through 1990, pop and rock dominated the charts, combining for some 60% of tag weights. Since then, as hip hop, rap, alternative, and indie have grown as individual forces, that number has plummeted to some combined 25%, with rock, a traditionally white genre, seeing an especially large decline. Nevertheless, this graph also shows that in some ways, musical diversity has declined, with some older genres falling out of favor entirely, such as jazz and folk, which had a notable niche in the 1960s and 70s.
Gender by Genre/Group Status
We also averaged the genre weight percentages divided into solo male artists, solo female artists, and groups, in order to determine which genres were more associated with different genders and what that said about underlying societal biases and trends.
Using this analysis, we determined that solo male artists were more likely to incorporate rock, hip hop, rap, and country influences, while solo female artists were more likely to perform pop, R&B, and dance music. The male genres, especially rock and rap, are associated with more aggressive, loud music, so in some ways it was not surprising that men were more likely to use these influences given gender stereotypes. Conversely, women performing “softer” genres such as pop and dance reflects opposing gender stereotypes about women. Groups were especially likely to perform rock and alternative music, which, although we were unable to get gender specific information for groups, highly suggests that many of these groups are likely entirely male. This has its own social implications, as male groups, for whatever reasons, are more likely to form than female ones. This could be a reflection of the greater level of instrumentation in some of these male-associated genres, requiring more band members in order to play the instruments.
Gender Overall and Gender Per Year Analysis
We created a pie chart for frequency of gender from 1963 to 2018. We found that unknown had the highest percentage overall (50%). This means that throughout the last 60 years, groups were the most prevalent artists.
Next, we looked at most common gender per year. (This is an interactive graph). We found that, for most of the years, groups dominate the music industry. As we get closer to 2018, more and more independent artists begin to move up in the charts. This is indicative of a move away from more group-centered trends. One reason for this could be due to the increasing technology in the music industry that allows for solo artists to use digitized instruments instead of hiring people who can play the instruments live.
Artist Level Analysis
We also used these genre categories to perform a number of individual artist level analyses on artists that we thought might be interesting or who were especially relevant on a societal level.
Taylor Swift
Taylor Swift’s graph reflects her individual genre evolution from initially having large country influences to being more poppy and dance/R&B-based. In many ways, this reflects larger societal trends, as country music has gone through long droughts on Billboard charts, which is perhaps a reflection of country artists outgrowing their roots and engaging in pop music. This has larger implications beyond country, suggesting that as artists gain fame, they tend to move in a poppier direction, contributing to the decline in musical diversity.
The Beatles
For other artists, our methods of analysis were less conclusive. For the Beatles, arguably the most influential artist in the last half century, rock and pop remained nearly exactly tied and constantly comprising some 75% of the total tags. While this accurately reflects that the Beatles consistently adapted pop and rock influences, it fails to show the diversity of those influences, as they all occurred in subcategories of pop and rock and we simply grouped during our data cleaning. As such, it doesn’t provide as many insights in some cases as we had hoped.
Anaylsis of Data:
Predictive Analysis
Hypothesis 1: Genre data can be used to predict the relative release year of each song.
In order to pursue this hypothesis, we used Naive Bayes and Random Forest classifiers to predict if each song was released before or after 1990. The following data was used for prediction: title, artist, and each of the genre ‘weight’ values, calculated based on our genre overlap ratios for each tag multiplied by the weight of that tag in the initial genre dictionary for that song (see: getEstimatedGenreOverlapBySong). We first created a class column based on the actual true/false value of the song occurring before or after 1990. From there, we dropped the year column, and performed a train-test split of 80-20. The following results were produced:
For NAIVE BAYES (Gaussian):GaussianNB: 0.872111 (0.015731)
0.8746594005449592
[[459 119]
[ 19 504]]
precision recall f1-score support
0 0.96 0.79 0.87 578
1 0.81 0.96 0.88 523
accuracy 0.87 1101
macro avg 0.88 0.88 0.87 1101
weighted avg 0.89 0.87 0.87 1101
For RANDOM FOREST:
Random Forest: 0.954571 (0.009340)
0.9600363306085377
[[555 23]
[ 21 502]]
precision recall f1-score support
0 0.96 0.96 0.96 578
1 0.96 0.96 0.96 523
accuracy 0.96 1101
macro avg 0.96 0.96 0.96 1101
weighted avg 0.96 0.96 0.96 1101
As such, both performed quite well, with Random Forest being 96% accurate, and Naive Bayes being 87% accurate, which for the most part confirmed our hypothesis. The fact that Random Forest was more accurate than Naive Bayes is not surprising, given the large number of attributes used in our analysis which likely contributed to more complicated trees. It is fairly impressive that both were so accurate, and could reflect a lack of music diversity that the presence of individual genres are able to so accurately predict a time period that a song comes from.
Hypothesis 2: Can we predict whether a song was in the top 20 of the year (rank) based on the artist, year, and song?
In order to pursue this hypothesis, we used all classifiers listed on the project description: Decision Tree (CART), KNeighbors (KNN), Naive Bayes(GNB), and Random Forest(RFC) classifiers to predict if a select song was in the top 20 of the year it was released based on the following data: artist, year, and song title. We first created a class column based on the actual true/false value of whether the song was in the top twenty of the year. From there, we dropped the rank column, and performed a train-test split of 80-20. The following results were produced:
Validation: KNN0.7821428571428571
[[863 42]
[202 13]]
precision recall f1-score support
0 0.81 0.95 0.88 905
1 0.24 0.06 0.10 215
accuracy 0.78 1120
macro avg 0.52 0.51 0.49 1120
weighted avg 0.70 0.78 0.73 1120
Validation: CART
0.69375
[[722 183]
[160 55]]
precision recall f1-score support
0 0.82 0.80 0.81 905
1 0.23 0.26 0.24 215
accuracy 0.69 1120
macro avg 0.52 0.53 0.53 1120
weighted avg 0.71 0.69 0.70 1120
Validation: GNB
0.69375
[[722 183]
[160 55]]
precision recall f1-score support
0 0.82 0.80 0.81 905
1 0.23 0.26 0.24 215
accuracy 0.69 1120
macro avg 0.52 0.53 0.53 1120
weighted avg 0.71 0.69 0.70 1120
Validation: RFC
0.69375
[[722 183]
[160 55]]
precision recall f1-score support
0 0.82 0.80 0.81 905
1 0.23 0.26 0.24 215
accuracy 0.69 1120
macro avg 0.52 0.53 0.53 1120
weighted avg 0.71 0.69 0.70 1120
As such, all performed okay, with KNN being 78% accurate, GNB being 69% accurate, CART being 69% accurate, and RFC being 69% accurate. With these numbers, we do not see that we can positively support the hypothesis. We were not surprised that the averages were what they were, since we used three attributes to predict which is a relatively small set needed for a good prediction.
Hypothesis 3: Based on the artist, song, rank, and year can we predict whether the artist was a band/multiple artist?
In order to pursue this hypothesis, we used all classifiers listed on the project description: Decision Tree (CART), KNeighbors (KNN), Naive Bayes(GNB), and Random Forest(RFC) classifiers to predict if we could predict whether the song was sung by a single artist or a band/multiple artist based off the artist(this gives no indication of gender), song, rank, and year. We first created a class column based on the actual true/false value of whether the artist was described as band/multiple or a single-gender (male or female). From there, we dropped the gender column, and performed a train-test split of 80-20. The following results were produced:
Validation: KNN0.6240474174428451
[[261 255]
[189 476]]
precision recall f1-score support
0 0.58 0.51 0.54 516
1 0.65 0.72 0.68 665
accuracy 0.62 1181
macro avg 0.62 0.61 0.61 1181
weighted avg 0.62 0.62 0.62 1181
Validation: CART
0.79424216765453
[[415 101]
[142 523]]
precision recall f1-score support
0 0.75 0.80 0.77 516
1 0.84 0.79 0.81 665
accuracy 0.79 1181
macro avg 0.79 0.80 0.79 1181
weighted avg 0.80 0.79 0.79 1181
Validation: GNB
0.6054191363251482
[[258 258]
[208 457]]
precision recall f1-score support
0 0.55 0.50 0.53 516
1 0.64 0.69 0.66 665
accuracy 0.61 1181
macro avg 0.60 0.59 0.59 1181
weighted avg 0.60 0.61 0.60 1181
Validation: RFC
0.7256562235393734
[[359 157]
[167 498]]
precision recall f1-score support
0 0.68 0.70 0.69 516
1 0.76 0.75 0.75 665
accuracy 0.73 1181
macro avg 0.72 0.72 0.72 1181
weighted avg 0.73 0.73 0.73 1181
As such, all performed okay, with KNN being 62% accurate, GNB being 61% accurate, CART being 79% accurate, and RFC being 69% accurate. With these numbers, we do not see that we cannot positively support the hypothesis. We were not surprised that the averages were what they were, since we used four attributes to predict which is a relatively small set needed for a good prediction.
Hypothesis 4: R&B and Rock generally have the same tags and so it can be assumed that a song under the genre of R&B will also be under the genre of Rock.
In order to test this hypothesis, we used a t-test on our genre correlation data. Unfortunately, it was difficult to find data that fell under a normal distribution, so we used rock and r&b, which we believed would be of some statistical significance. Using the column for r&b exact, as well as the column for rock_exact, we ran a t-test. The findings of the t-test did not support our hypothesis, meaning that these differences between the 2 datasets did not happen by chance. Any similarities between the 2 groups were insignificant.
Ttest_indResult(statistic=-0.3077861198415526, pvalue=0.7582847964168427)We also used a linear regression to see the correlation between r&b and rock. While it doesn’t appear absolutely random, there isn’t a strong correlation between the data points. The regressor intercept was about 0.45 and the regressor coefficient was -0.57.
Sentiment Analysis:
As part of Data, we collected Lyrics for songs that exist in the top 100 between the years of 1963 to 2018. The idea of collecting this data was to be able to run a sentiment analysis, to get a sense of the sentiment of the top 100 songs given the year. The analysis was done by using textblob and nltk algorithms to predict the sentiment polarity, which is a score between -1 and 1 where numbers between -1 to 0 represents negative sentiment, 0 represents neutral, and 0 to 1 representing positive sentiment. The following images are of the overall sentiment of the lyrics for all years and the distribution of sentiment analysis scores (or the polarity score).
Looking at this graph, we can see that the overall, most songs between 1963 to 2018 tend to have a positive sentiment compared to negative and neutral. To our group, this is surprising as it was found that “Listening to sad songs that have no direct relevance to our lives allows us to vent sadness in a safe context with no real-life implications or consequences”. What we could attribute this to how well verse NLTK is, being that natural language processing is very difficult as human speech is very unique. It could be that the tool-kit could have misinterpreted a songs meaning, as love is a happy word but many sad songs use love (being that love is one of our strongest feelings). Another thing to note, is that sentiment is clumped into three categories while the actual scores represent a larger range. This is due to the generalness of NLTK, as it is not meant to give a specific sentiment, rather a generalized version as Natural Language processors are not that advanced yet.
Reference:https://www.psychologytoday.com/us/blog/longing-nostalgia/201501/why-we-love-sad-songs
Ethical considerations:
We collected data from a LastFM API's, Billboard's public rankings on songs, MusicBrainz, and lyrics from a lyric website. These were all public sites with data that was given voluntarily, so we don't believe there were any ethical privacy violations when collecting this data. Our analysis was conducted on song rankings, artists' gender, and lyrical data. We also considered how we collected the data and what that could mean for researchers attempting to use our analysis in the future.
We determined that our analysis could potentially affect the people who listen to music, sociologists and the way they analyze social movements, other data scientists looking to use the information we have gathered, historians trying to determine the popularity of social movements and how those movements radiate through the rest of society, and professionals in the music industry looking to alter/dominate the future of music production.
One of the data attributes that we found to be slightly controversial was that on gender. The only options we provided were male, female, or unknown (for groups). This is consistent with binary gender norms, which we recognize that some artists may not fall under.
Another consideration was that of the music rankings themselves. While the information there isn't controversial, there are many biases behind the rankings themselves. Especially for our specific analysis on social movements, Billboard's top 100 doesn't really include more niche genres of music. Specifically for this project, that could potentially exclude social movements that are more underground but still have a profound impact on society. They are most likely entirely excluded from this analysis.
Conclusions:
During the 60's, it wasn't easy to listen to your favorite artists or diversify your music tastes. Most people listened to music on the radio, live in concert, televised, on tapes, or occasionally on cassette. This meant that the power to determine what people listened to lay more with the people who owned and operated radio and television stations. If you liked a song, it was difficult to listen to it outside of these specific mediums. Furthermore, there were systematic barriers in place that prevented certain genres of music from becoming more popular because they were associated with certain races during a period in which racism was prevalent and condoned. So, even though the 60's were a time in which much of American society was changing-- from the Civil Rights Movement, to the women's movement, to the gay rights and environmental movement-- there isn't sufficient record in the music industry because more control was held by the producers over what songs were heard and discrimination kept minority groups, and genres associated with them, from becoming popular. Similarly, women were left largely out of the mainstream media.
The 70's saw a decrease in soul and an increase in dance music towards its later years, while everything else remained largely the same. The 70's was a period of cultural revolution for the United States, but it was also largely a continuation of the 60s. The political turmoil during this time (Watergate scandal, backlash by conservatives against progressive movements) made many people turn away from politics and towards pop-culture. Towards the end of the decade, young people were using this new found freedom to explore everything from what they wore to drugs and sex. We believe the increase in dance music is indicative of this trend because people were finding expression through music and movement outside of politics. The 1970's saw a decrease in group artists and a slight increase in women artists.
Reference:https://www.history.com/topics/1970s/1970s-1
The 80's was a new period of materialism and conservatism in the United States. The counterculture of the previous decades turned many people towards this new attitude. According to our analysis, the 80's was marked by a spike in pop music, and a decrease in rock and soul, while everything else remained fairly similar. For popular culture, the most revolutionary new network was MTV, which televised popular music videos. MTV had a huge influence on the music that was popularized and sold (artists like Madonna!). As the decade continued, MTV became a platform for those who were dissatisfied with conservative thought at the time; it became a place for minority groups and niche genres to share their stories and artistic expression. However, as we mentioned earlier, these niche movements didn't become as publicized, especially in something like the Billboard Top 100.
Reference:https://www.history.com/topics/1980s/1980s-1
The 90's seemed to be where the most fluctuation in music popularity occurred. Pop and rock music decreased substantially, while R&B, rap, and hip-hop increased. This was a period that saw the rise of the internet and discovered the unlimited potential it possessed. Because people could now listen to music almost instantaneously, consumers had increased ability to determine what became popular. While it isn't apparent in our analysis, the increase in grunge allowed for the normalization of what was considered "countercultural". There was a substantial increase in black urban music, as well as pop music, which is reflected in our graph on genre. The end of the 90's also saw an increase with singular artists (specifically women), and a decrease in groups and bands.
Reference:https://www.udiscovermusic.com/stories/90s-music/
The 2000's continued much of the trends that began in the 90's. However, with the 9/11 attack in New York, artists tried to reflect a feeling of optimism while also acknowledging the nationwide grief felt at the time. The internet made way for unlimited and unprecedented access to music for consumers, as well as the ability for smaller artists to upload songs without the need for record labels. Hip-hop and rap was on the rise, as well as smaller increases in electronic music. Rock continued to decline, as well as soul and dance music. With the innovation and pervasiveness of technology, music began to incorporate and reflect these changes at such a wide scale that they revolutionized the music industry.
Reference:http://www.thepeoplehistory.com/music.html
https://en.wikipedia.org/wiki/2000s_(decade)#Music
Overall, we found that music does indeed reflect the social movements that are happening during the time period. However, it became clear to us that the reflection is only prevalent during the recent decades, in which consumers had greater power over the music they listened to and greater access to a wider variety of genres and artists. Smaller artists were also able to upload music directly to the internet and disperse it to a wide audience without the need for a record label. This revolutionary new ability allowed for wider diversity in the music industry and reflected, on a greater scale, the movements happening among the people.